# Multimodal contrastive learning
Eva02 Large Patch14 Clip 224.merged2b
MIT
The EVA CLIP model is a vision-language model based on OpenCLIP and timm model weights, supporting tasks such as zero-shot image classification.
Image Classification
E
timm
165
0
Eva02 Enormous Patch14 Clip 224.laion2b
MIT
EVA-CLIP is a vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks.
Text-to-Image
E
timm
38
0
Vit H 14 CLIPA Datacomp1b
Apache-2.0
CLIPA-v2 model, an efficient contrastive vision-language model designed for zero-shot image classification tasks.
Text-to-Image
V
UCSC-VLAA
65
1
Vit H 14 CLIPA 336 Laion2b
Apache-2.0
CLIPA-v2 model, trained on the laion2B-en dataset, focusing on zero-shot image classification tasks
Text-to-Image
V
UCSC-VLAA
74
4
Vit B 16 SigLIP
Apache-2.0
SigLIP (Sigmoid Loss for Language Image Pre-training) model trained on the WebLI dataset for zero-shot image classification tasks.
Text-to-Image
V
timm
27.77k
31
CLIP ViT B 32 Laion2b E16
MIT
A vision-language pretrained model implemented based on OpenCLIP, supporting zero-shot image classification tasks
Text-to-Image
C
justram
89
0
CLIP ViT L 14 CommonPool.XL S13b B90k
MIT
A vision-language pretrained model based on the CLIP architecture, supporting zero-shot image classification and cross-modal retrieval tasks
Text-to-Image
C
laion
4,255
2
CLIP ViT B 32 CommonPool.M.clip S128m B4k
MIT
Zero-shot image classification model based on CLIP architecture, supporting general pooling functionality
Image-to-Text
C
laion
164
0
CLIP ViT B 32 CommonPool.S.laion S13m B4k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks
Text-to-Image
C
laion
58
0
CLIP ViT B 32 CommonPool.S.image S13m B4k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks
Text-to-Image
C
laion
60
0
Eva02 Large Patch14 Clip 224.merged2b S4b B131k
MIT
EVA02 is a large-scale vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks.
Image Classification
E
timm
5,696
6
Vit Large Patch14 Clip 336.openai
Apache-2.0
CLIP model developed by OpenAI, using ViT-L/14 architecture, supports zero-shot image classification tasks
Text-to-Image
V
timm
35.62k
2
CLIP ViT G 14 Laion2b S34b B88k
MIT
CLIP ViT-g/14 model trained on the LAION-2B dataset, supporting zero-shot image classification and image-text retrieval tasks
Text-to-Image
C
laion
76.65k
24
Xclip Base Patch16 Zero Shot
MIT
X-CLIP is a minimalist extension of CLIP for general video-language understanding, trained contrastively on (video, text) pairs, suitable for zero-shot, few-shot, or fully supervised video classification as well as video-text retrieval tasks.
Text-to-Video
Transformers English

X
microsoft
5,045
24
Clip Vit Base Patch32
CLIP is a multimodal model developed by OpenAI that can understand the relationship between images and text, supporting zero-shot image classification tasks.
Image-to-Text
C
openai
14.0M
666
Clip Italian
Gpl-3.0
The first contrastive language-image pretraining model for Italian, based on Italian BERT and ViT architecture, achieving competitive performance with only 1.4 million fine-tuned samples
Text-to-Image Other
C
clip-italian
960
16
Clip Vit Large Patch14
CLIP is a vision-language model developed by OpenAI that maps images and text into a shared embedding space through contrastive learning, supporting zero-shot image classification.
Image-to-Text
C
openai
44.7M
1,710
Featured Recommended AI Models